Evaluating batch correction methods for image-based cell profiling

High-throughput image-based profiling platforms are powerful technologies capable of collecting data from billions of cells exposed to thousands of perturbations in a time- and cost-effective manner. Therefore, image-based profiling data has been increasingly used for diverse biological applications, such as predicting drug mechanism of action or gene function. However, batch effects severely limit community-wide efforts to integrate and interpret image-based profiling data collected across different laboratories and equipment. To address this problem, we benchmark ten high-performing single-cell RNA sequencing (scRNA-seq) batch correction techniques, representing diverse approaches, using a newly released Cell Painting dataset, JUMP. We focus on five scenarios with varying complexity, ranging from batches prepared in a single lab over time to batches imaged using different microscopes in multiple labs. We find that Harmony and Seurat RPCA are noteworthy, consistently ranking among the top three methods for all tested scenarios while maintaining computational efficiency. Our proposed framework, benchmark, and metrics can be used to assess new batch correction methods in the future. This work paves the way for improvements that enable the community to make the best use of public Cell Painting data for scientific discovery.


C
Supplementary Figure 2: Evaluation Scenario 1. A) Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six biometrics).Metrics are mean aggregated by category.Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.Visualization of integrated data colored by B) Compound, and C) Batch.Left-to-right layout reflects the methods' descending order of performance.We selected 18 out of 302 compounds with replicates in different well positions to account for position effects that may cause profiles to look similar.Alphanumeric IDs denote positive controls.Source data are provided as a Source Data file.

C
Supplementary Figure 5: Evaluation Scenario 5. A) Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six biometrics).Metrics are mean aggregated by category.Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.Visualization of integrated data colored by B) Source, and C) Microscope.Left-to-right layout reflects the methods' descending order of performance.Source data are provided as a Source Data file.

Isolated compounds performance
Around 30% of the compounds of Scenario 3 are present in all three sources (sources 2, 6, and 10).We used this scenario to assess the replicate retrieval performance of sub-populations of compounds that are not shared between different batches (i.e.sources, in this setup).We used the corrected profiles from the best-performing correction method in the scenario -Seurat CCA-to evaluate.We picked the 10,136 compounds that present in sources 2 and 6 but not in source 10 (i.e., they are isolated to sources 2 and 6).We compared the performance of this subpopulation (named as two sources in Sup Figure 6) with the performance of a subpopulation of 23,782 compounds present in all of the three sources (named as three sources in Sup Figure F).Then we compute the mAP (control) score for each subpopulation, noting that we pick only the replicates from source 2 and source 6 and ignoring the replicate from source 10.We observed that the compounds that exclusively belong to two sources performed better than compounds present in all three sources, which contradicts the over-correction hypothesis.A likely explanation is that the correction task gets more difficult as there are more sources to align.

Runtime analysis
We measured the runtime for non-GPU methods and metrics across the five scenarios using a c6i.16xlargeAWS EC2 instance equipped with 64 cores and 128GB of RAM.A log-log plot of the results (Sup Figure 7) suggests a power-law relationship between runtime and sample size.Extrapolating this trend reveals that applying Harmony, one of the top-performing methods, at the single-cell level (Sup m a d _ d r o p _ i n t _ f e a t s e l e c t m a d _ i n t _ f e a t s e l e c t m a d _ d r o p _ i n t m a d _ d r o p _ f e a t s e l e

SourceC
Supplementary Figure3: Evaluation Scenario 2. A) Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six biometrics).Metrics are mean aggregated by category.Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.Visualization of integrated data colored by B) Compound, and C) Source.Left-to-right layout reflects the methods' descending order of performance.We selected 18 out of 302 compounds with replicates in different well positions to account for position effects that may cause profiles to look similar.Alphanumeric IDs denote positive controls.Source data are provided as a Source Data file.
Comparison of the replicate retrieval performance (mAP) of sub-populations of compounds that are not shared between different batches.The sub-population present in only two sources (n=10,136 compounds) performed better than the sub-population in three sources (n=23,782 compounds).Data extracted from the Scenario 3. Source data are provided as a Source Data file.
Supplementary Figure1: preprocessing mAP scores in Scenario 1 (n=8,064 wells).Every combination is encoded in the name as follows.mad:medianabsolute deviation normalization; clip: clip outlier values to 500; drop: drop any column with an outlier value; imputemedian: impute outliers with median value; imputeknn: impute outlier values with KNN; featselect: Feature selection process using variance threshold, correlation threshold operations from PyCytominer [1]; int: rank-based Inverse normal transformation.negconandnonrep represent scores for replicability[2].Source data are provided as a Source Data file.

Table 1 )
would be prohibitively time-consuming.Processing a single plate would take approximately 2.6 hours, a single batch would require 33 hours, and a single source (out of 13 in the full JUMP Cell Painting dataset) would take 11 days.Moreover, loading a single source would require 2.7 TB of memory.Similarly, the runtime extrapolation of Seurat CCA estimates that processing the entire JUMP dataset at the well level, containing approximately 890,000 well-level profiles, would take 17 days.

Table 1 :
Count of single cells at different levels in the JUMP CP Dataset.Supplementary Figure8: Scatter plot of the mean Batch correction and Bio-metrics for all the tested methods across the five scenarios, reflecting the increasing difficulty of scenarios.Bars represent one standard deviation in the respective axis.Source data are provided as a Source Data file.Supplementary Figure9: Comparison of best and worst batch correction methods, reflecting the variability of the performance with respect to the complexity of the scenarios (scenarios are sorted by overall mean score).Source data are provided as a Source Data file.